As a result of the pandemic, there has been a significant increase in the number of online jobs offered on various employment portals. It has been reported that not all job postings are legitimate, which can pose problems with the job posting website and their credibility. It can also be a security threat since the scammers can be using the applicants' information to steal their identities. Using advanced deep learning techniques, we are trying to predict whether these job postings are real or fake to be filtered early on.
Our problem contains only one available dataset provided by The University of the Aegean and available on kaggle; it was used in many models and research papers.
Dataset Name: Dataset of real and fake job postings
Link: https://www.kaggle.com/shivamb/real-or-fake-fake-jobposting-prediction
Size: 50MB
This dataset includes almost 18 thousand job posts, 866 of them are fake jobs. The dataset consists of 18 columns including:
Our input is a string containing the job description and the output is whether the job post is fradulent or not
BERT : Bidirectional Encoder Representations from Transformers
One of the most powerful NLP models that are currently being used in recent studies. It is capable of processing text both left-to-right and right-to-left at the same time.
It also provides pre-trained models on large amounts of data to facilitate tasks such as semantic labeling and, more importantly, in our case, sentence classification.
Also, we can find below the state of the art results acheived by several models on the same dataset we are using:
Orginal model used was Bi-LSTM
Our orignal code model consisted of the following layers:
We acheived an accuracy of 96% however, oversampling did not solve overfitting issue
Not all techniques that solve overfitting work. We have tried oversampling the data and it still overfit
Since the model takes a lot of time to train it was very difficult to do extensive hyper-parameter tuning and variations in the model architecture.
GloVe and the cleaning of the text and NLP in general highly affected the data thus the results were extremely different in a good way so we recommend it.